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Abstract. The SINTAGMA information integration system is an infras- 
tructure for accessing several different information sources together. Be- 
sides providing a uniform interface to the information sources (databases, 
web services, web sites, RDF resources, XML files), semantic integration 
is also needed. Semantic integration is carried out by providing a high- 
level model and the mappings to the models of the sources. When execut- 
ing a query of the high level model, a query is transformed to a low-level 
query plan, which is a piece of Prolog code that answers the high-level 
query. This transformation is done in two phases. First, the Query Plan- 
ner produces a plan as a logic formula expressing the low-level query. 
Next, the Query Optimizer transforms this formula to executable Prolog 
code and optimizes it according to structural and statistical information 
about the information sources. 

This article discusses the main ideas of the optimization algorithm and 
its implementation. 



1 Introduction 

Integration of heterogeneous information sources requires building an infras- 
tructure for accessing several different information sources together. One task 
of the integration is to provide a uniform interface to the different information 
sources (databases, directory servers, web services, web sites, XML files). The 
other task is the semantic integration, as the meaning of the stored data can 
also be different in the different sources. 

In the SINTAGMA system, successor of the SILK [T] system, semantic inte- 
gration is carried out by building a high-level model and the mappings between 
the high-level model and the models of the information sources. When execut- 
ing a query of the high-level model, the query has to be transformed to queries 
of the sources and to the code performing the semantic transformation of the 
data. The component of the SINTAGMA system responsible for planning and 
executing queries is the Mediator. The subcomponents of the Mediator, which 
translate a high-level query to a low level query are the Query Planner and the 
Query Optimizer. 



The output of the Query Planner is a Prolog predicate body. This Prolog 
code requires call reordering to be executable: while some sources can be called 
with arbitrary argument instantiations (e.g. predicates representing SQL tables), 
some predicates only can be called when a certain subset of their arguments is 
instantiated (predicates representing web services, etc). 

The main and compulsory goal of the optimization step is to make the query 
executable. The available modes of the predicates are given, and with this infor- 
mation at hand it is decidable whether a sequence of goals is actually callable. 
The secondary goal of the optimizer is to lower the total cost of calling the query, 
which is basically the estimated execution time of it. For this, some statistical 
information is available on the average execution time of the predicates, and also 
on the number of their solutions. 

With this information at hand, the optimizer not just rearranges the order 
of goals in the disjunctive branches, but it does other manipulations on the code 
in the hope of obtaining a piece of code (i.e. a query) with better performance 
characteristics. 

Compared to the SILK system, one of the main features of SINTAGMA is 
the Query Optimizer. SILK did not have this component, and query planning 
usually needed manual tuning of the resulting plan. Other important new feature 
of SINTAGMA is the ability to use negation and aggregation in the queries and 
in the model mappings. These new features are presented for the first time in 
this article. 

Section [2] introduces the basic concepts used by the Query Planner and Op- 
timizer. Section [3] introduces the main ideas of optimizing and discusses the 
optimizer algorithm. Section [4] discusses the execution time issues of the imple- 
mentation. Section [5] compares the Query Optimizer of SINTAGMA to other 
systems, and Section |6] concludes the paper. 

2 Preliminaries 

In the Query Planner and Optimizer, the queries are built from predicates 
using the symbols of conjunction, disjunction, negation and aggregation. 

2.1 Predicates 

Predicates, like in Prolog, represent a relation among the arguments of the 
predicate. If a predicate is called with a subset of its arguments instantiated, the 
predicate tells the values of the remaining arguments (by instantiating them), 
which satisfy the relation. A tuple of values in relation is called a solution of the 
predicate. When a predicate is called, it can answer with one or more solutions, 
or with no solutions (failure). Note that, in contrast with Prolog, the predicates 
of our framework always instantiate all their arguments, and an argument is 
either ground or uninstantiated, there are no partially instantiated terms. 

We distinguish between two kinds of predicates: 
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Source predicates: These predicates represent the data sources, e.g. tables 
of relational databases, or methods of web services. Arguments of the rela- 
tion correspond the columns of a database table, or the (input and output) 
arguments of a web service method, etc. 

Constraint predicates: These predicates represent relations among their ar- 
guments, which are described by a known algorithm. Such predicates are 
usually implemented in Prolog. When such a predicate is called, the Me- 
diator does not call some external entity, but answers the predicate call 
by executing the algorithm. Note that these constraints are not necessar- 
ily constraints of a CLP constraint system, they are just Prolog predicates 
satisfying special requirements, as discussed later. 

2.2 Query Plans 

The output of the Query Planner is a query described by the following gram- 
mar: 



Query ::= Query, Query 
I Query; Query 
I not (Query) 

I aggregate (GroupVariables , SetExpressions , Query) 

I SourcePredicate 

I ConstraintPredicate 



Queries use a notation similar to that of Prolog: comma ( , ) denotes conjunc- 
tion, semicolon (;) disjunction and not negation. Aggregation is explained in 
Section HH 

The result of executing a query is a set of solutions, and a solution is a 
mapping, which assigns values to the variables of the query. 

2.3 I/O modes of constraint and source predicates 

Just like in Prolog, some of our predicates require a subset of their arguments 
to be instantiated at the time of their call. The Mediator has to know the allowed 
I/O modes of the predicates. The Query Optimizer makes queries which respect 
the I/O modes of the involved predicates, and rejects a query, if no such query 
plan can be made. 

The I/O mode of a predicate is a mapping, which assigns in or out to the 
argument positions of a predicate. The meaning of these is similar to the modes 
of the same name in Mercury, a purely declarative logic programming language 
[2] . If a mode has an in for an argument position, it means that the argument 
must be ground when calling the predicate, an out means that the argument 
might be uninstantiated. A predicate can have several modes. If a predicate has 
more than one mode, it means that when calling the predicate, its arguments 
must be compatible with at least one of its modes. If a predicate has only one 
mode, we often use input and output as adjectives for describing arguments. 
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Constraint predicates behave differently than source predicates. Constraint 
predicates also have I/O modes like source predicates, but while source predi- 
cates have to be called with arguments compatible with their modes, constraint 
predicates can be called any time, independently of the instantiation of their 
arguments. When a constraint predicate is called, it checks its arguments and 
depending on their state, it does the following: 

1. If the state of the arguments of the predicate is compatible with one of its 
modes, it instantiates its uninstantiated arguments and finishes its operation. 

2. If not, it "falls asleep", and lets the query plan continue running. While 
letting the query plan run, it waits for its arguments to be instantiated. 
When this happens, it goes back to step 1. 

A constraint predicate always finishes its operation when the state of its argu- 
ments becomes compatible with one of its modes, but in certain cases (depending 
on the particular predicate), it can quit earlier. For example, the predicates rep- 
resenting the relations A > B and A < B finish their operation only when both 
A and B become instantiated, but in the presence of each other, they can finish 
their operation when one of them gets knowrQ. 

A constraint predicate instantiates its uninstantiated arguments when it fin- 
ishes its operation, but is allowed to instantiate some of its arguments earlier. 
There is no currently implemented constraint predicate which would behave like 
this, but it is worth mentioning that this behaviour is supported by the Query 
Optimizer. 

2.4 Optional input arguments 

When creating well-moded query plans, the Query Optimizer has to assume 
that when source predicates are called and when constraints complete, they 
instantiate all their uninstantiated arguments. During the development of the 
Mediator, there was an increasing demand for a more flexible handling of pred- 
icates, namely, for having optional input arguments. 

An optional input argument (optin in the following) is an argument of the 
predicate, but not part of the relation, rather a parameter for the relation. An ar- 
gument of a predicate is an optin argument if the predicate does not instantiate 
it when the argument is uninstantiated at the time of call. 

As an example for an optional input argument, let us examine a possible 
information source, which is a web-service implementing a search engine. The 
source has three arguments. The first argument is input, the source expects the 
words to search for in this argument. The third argument is output, the source 
enumerates the addresses of those documents that contain the given words. The 

1 If these constraints are implemented with CLP(R), the two daemons associated with 
the constraints quit immediately (after unifying the variables), but for the Query 
Optimizer it is the instantiation state of the constraint arguments which is important, 
not the presence of daemons, and the variables are instantiated only after one of them 
gets known. 



4 



second argument is optional input, it can specify the file type (.pdf, .ps, .doc, 
etc), but it is not mandatory. If it is instantiated, the source answers with only 
such documents which are of the given type. If not, the source answers with 
addresses of files of the default type, for example, .html. 

In this example we can note that this argument is not part of the relation, 

as: 

— If not instantiated at the time of call, the predicate does not instantiate it. 

— If not instantiated, the answers are not of all the possible file types. 

When dealing with optional input arguments, we have to face a problem: 
The query plan is invalid if a predicate is called with an uninstantiated optional 
input argument, but the argument variable is instantiated later. This is because 
the optional input argument of the predicate becomes known, but the predicate 
did not operate according to its value. 

The rule that describes the correct treatment of optional input arguments is 
the following: A predicate is callable if its uninstantiated optional input argu- 
ments will not be instantiated at a later point of the query. The query planner 
has to make query plans that respect this rule and has to reject a query if no 
such plan can be made. 

Let us examine the following example with two sources: 

— a(in,out) (the first argument of the source has to be instantiated (input)) 

— b(optin,out) (the first argument is optional input) 

The plan a(l ,X) ,b(X, Y) is well-moded, because the optional input argument 
X is instantiated at the time of calling b(X,Y). 

The plan b(X,Y) ,a(Y,_) is well-moded, because although the optional in- 
put argument X is not instantiated at the time of calling b(X,Y), it remains 
uninstantiated. 

The query b(X,Y) ,a(Y,X) is not acceptable, because: 

— the plan a(Y,X) ,b(X, Y) is ill-moded, because a(Y,X) is called with Y unin- 
stantiated. 

— the plan b(X,Y) ,a(Y,X) is ill-moded, because when calling b(X,Y), X is not 
instantiated, but it is instantiated later, by the call a(Y,X) 

Regarding optional input arguments, we have to enforce the following rule: 
If an argument is an optional input argument in one mode, it must be an op- 
tional input argument in all other modes as well, and the predicate might not 
instantiate that argument under any condition. This is required, because at cer- 
tain points of query planning, it must be known whether a variable might get 
instantiated at a later point of the query or not0. 

2 This is also required because of negations and aggregations, discussed later. 
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2.5 Negation 



SINTAGMA uses the closed world assumption for handling negation. Proce- 
durally, this is implemented as negation by failure. 

A well known problem with negation by failure is when the negation is called 
before all its variables are instantiated. In our framework, this is a problem only 
if an uninstantiated argument of a negated predicate gets instantiated later. To 
avoid this, the Query Optimizer considers a negated query callable (well-modcd) 
only if its uninstantiated variables will not be instantiated at a later point of the 
query. 

2.6 Aggregation 

Aggregation is used to partition the solutions of a query ("GROUP BY" 
in SQL), and combine the solutions in each partition into a single solution. 
During the design of the SINTAGMA system, an important goal was to have a 
query language which is at least as expressive as the query language SQL. The 
Mediator of the SINTAGMA system allows the aggregation of queries spanning 
several information sources, the use of the standard SQL set functions (count, 
sum, min, max, etc. . . ), and the ability to extend the system with custom set 
functions. 

The syntax of aggregation is the following: 
Aggregation ::= aggregate (GroupVariables , SetExpressions , Query) 
Group Variables : := ListOf Variables 

ListOf Variables ::= [] I [Variable I ListOf Variables] 

SetExpressions : := [] I [SetExpression I SetExpressions] 
SetExpression : := Variable=SetFunctionName (Argument) 
Argument : := PrologTerm 



Let us show this construct through an example: 
aggregate ( [Department] , [A vgSal=avg (Salary)] , 

(works_at (Employee , Department) , salary (Employee , Salary) ) ) 
Here, Department is the base of grouping, and AvgSal will be bound to the 
average of the Salary values, for each group, therefore this query returns the 
list of departments and the departmental average salaries. The semantics of 
aggregation is the following: 

1. Query is executed. 

2. From the solutions of Query, groups are formed. The basis of grouping is 
GroupVariables, which is a list of some of the variables of Query. The 
members of a group are those solutions, for which the values of variables in 
GroupVariables are the same. 

3. We calculate the value of each set function in the SetExpressions list, for 
each of the groups. 
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4. The aggregated query has one solution for each group. A solution is the 
instantiation of the variables in GroupVariables, and the instantiation of 
the variables on the left-hand side of the = symbols in SetExpressions. No 
other variables are instantiated and no sleeping constraints are left behind. 

Note that with aggregation, there is a similar problem as with negation, the 
query plan is invalid if one of the variables in the aggregated query is uninstan- 
tiated at the time of calling the aggregation, but later gets instantiated. 

3 Query Optimizer 

Optimizing queries means choosing the most efficient query plan among the 
well-moded ones. 

3.1 Base cases of Optimization 

This subsection summarises the main query optimization techniques used in 
the Query Optimizer of SINTAGMA. 

Reordering conjunctions: The execution of a conjunction means executing 
the first member of the conjunction, then executing the remaining part, for 
each solution the first member has. As a consequence, the order of members 
in a conjunction radically affects performance. It is well known that putting a 
member with a small expected number of solutions to the first place leads to 
a better execution time than putting a member with a plenty of solutions [3] . 
This optimization technique is similar to the techniques used by the query 
planners of database engines [I], [5], [5], but while the database has exact 
knowledge about the tables (keys, indexes, table sizes, number of different 
values in columns), we have to make do with some statistical data. 

Constraints first: In a query plan, source predicates can be called only in 
places where their arguments are sufficiently instantiated, but constraint 
predicates can be called anywhere. We know when constraint predicates are 
bound to instantiate all their arguments and finish their operation, but they 
can instantiate some of their arguments or fail before that point. For this 
reason, it can be beneficial to call constraint predicates way before their 
arguments get instantiated. 

Postponing disjunctions: The part of the query plan after a disjunction is 
executed as many times as many branches the disjunction has, therefore 
postponing disjunctions is profitable. On the other hand, the sub-query after 
a disjunction can be moved inside the branches of the disjunction, and can 
be optimized differently in the different branches, which is also beneficial. In 
such cases branching on disjunctions is not postponed. 

Delegating constraints to sources: Some information sources can understand 
some constraints on their own. For example, an SQL database understands 
a "smaller than a given number" constraint. When querying such source, the 
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constraints on the variables of the query should be sent to the source, in order 
it can filter its answers according to the constraints, as this is cheaper than 
transferring all the solutions to the Mediator and filtering them there. In 
practice, the source-level query sent to the sources contains the source-level 
equivalents of the sleeping demons at the time of the source call. 
Grouping source predicates together: Some sources can perform joins on 
their own. If some source predicates are linked through a common variable, 
and they refer to the same information source, it is beneficial to send one 
compound query instead of querying the source according to the first pred- 
icate and querying it again according to the second for each solution of the 
first. 

These optimization techniques are simple and the transformations they sug- 
gest are not particularly difficult to implement, but the transformations con- 
tradict each other. Deciding which ones to use in certain situations is done by 
estimating the cost of the resulting queries and choosing the most promising 
plan. 

3.2 Optimization: the Naive Approach 

For the Query Optimizer of SINTAGMA, the following information about 
the predicates is available during planning: 

— Allowed I/O modes for each predicate 

— Expected number of solutions of a predicate for certain I/O modes 

— Expected cost (execution time) of a predicate for certain I/O modes 

The simplest way of optimizing a query plan is generating all orders of the 
conjunctions, throwing away the ill-moded orders, then calculating the estimated 
cost of each one, and choosing the plan with the smallest estimated cost. 

Generating all the possible orders could be done with a very simple recur- 
sive algorithm, which generates all the permutations of the conjunctions in the 
query while recursively generating all the orders of the members of conjunc- 
tions. Generating all the possible orders and filtering out the ill-moded ones, 
then calculating their cost and choosing the best would be a very inefficient way 
of optimizing. Instead of that, the query optimizer interleaves these tasks, gen- 
erates only well-moded plans, calculates their cost at the same time, and throws 
away partially computed plans that are known to lead to more expensive plans 
than the previously found best plan. This branch-and-bound method of finding 
the best plan is still exponential in execution time, but no polynomial-time al- 
gorithm is expected as the problem is NP-hard. Luckily, the size of the plans 
the Optimizer has to handle allows us to use a well-implemented exponential 
algorithm, instead of using approximation techniques for finding near-optimal 
solutions. 
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3.3 The Optimization Algorithm 

The optimization of queries is done by a procedure with the following input 
arguments: Query, Continuation, InstVars, Constraints. The output argu- 
ments are OptimizedQuery, ResultVars, Cost and NumSol. The result of op- 
timization (OptimizedQuery) is a conjunction which starts with the optimized 
Query and continues with the optimized Continuation. The procedure is ini- 
tially started with the query to optimize in the Query argument and an empty 
query in Continuation. InstVars is the set of variables that were already instan- 
tiated by the query parts preceding Query and Continuation, and Constraints 
is the set of sleeping constraints. ResultVars is the set of variables which are 
necessarily instantiated by OptimizedQuery. Cost and NumSol are the estimated 
cost and number of solutions of OptimizedQuery. 

With these input and output arguments, optimization can be carried out in 
parallel with checking mode correctness and calculating cost. The algorithm is 
described by Prolog code fragments. These code pieces give a high level view of 
the algorithm. Some details of are left out, for example the calculation of costs 
and number of solutions. 

The task of the procedure depends on its Query argument. The most difficult 
case is when Query is a conjunction. The code fragment for dealing with a 
conjunction is shown in Figure [T] If Query is a conjunction or a source predicate 
(a conjunction of one), it first appends Continuation to Query (line 1), resulting 
in a conjunction of many members. Then, it chooses all the constraint predicates 
to fill the first places of the resulting query (lines 3-7). When there are no more 
constraint predicates, it chooses each of the members and recursively optimizes 
them with the remaining members as the Continuation. If the chosen member 
is not a source predicate, then the optimisation is simply a recursive call to 
optimize (lines 9-12). 

If the chosen member is a source predicate, then the optimizer tries to pack 
it together with other source predicates that can be called in succession and 
refer to the same information source. It does this grouping in all possible ways 
(line 14). Source query packs will also include a suitable subset of the sleeping 
constraints in order to be sent to the source as well (line 17). 

Next, let us examine the case of a disjunction (Figured]). If the query to be 
optimized is a disjunction, it means that it is decided that the disjunction will 
be the first member of a conjunction. This, however does not mean that the 
predicates inside the branches of the disjunction will precede the predicates in 
the continuation. It only means that at this point the query has to fork with a 
disjunction. 

The independent optimization of the two branches and the continuation is 
beneficial, because of the following: It is possible that when the two branches of 
the disjunction are optimized together with the continuation, the optimization 
would prefer to order the members of the branches and the continuation in dif- 
ferent ways, which means that the conjunction of the optimized disjunction and 
the optimized continuation is sub-optimal. This way of optimizing disjunctions 
is why the Continuation argument is needed. 
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1 concat_conjunction(Query , Continuation, WholeQuery) , 

2 ( 

3 select_from_conjunction(Element , WholeQuery .Rest) , 

4 is_constraint (Element) , 

5 is_callable (Element) , 

6 -> 

7 optimize (Element .Rest , InstVars , Constraints , OptimizedQuery .Result Vars) 

8 ; 

9 select_f rom_conjunction(Element ,WholeQuery .Rest) , 
10 not_constraint (Element) , 

n not_source_pred(Element) , 

12 optimize (Element .Rest , InstVars , Constraints .OptimizedQuery .Result Vars) 

13 ; 

14 select _callable_source_pred_sequence (Sour cePreds , WholeQuery , Rest ) , 

15 '/, if such cannot be selected, we fail here 

16 

17 create_source_query_pack(SourcePreds , Constraints , SourceQuery) , 

is instantiates_variables(SourcePreds .Vars) , 

is union(InstVars , Vars , InstVarsl) , 

20 wake_constraints (InstVarsl , Constraints , InstVars2 , Constraintsl) , 

21 optimize (Rest , empty, Inst Vars2 .Constraintsl .OptRest .ResultVars) , 

22 create_conjunction(SourceQuery, OptRest , OptimizedQuery) 

23 ) 



Fig. 1: Optimizing Conjunctions 



There three two more cases, the optimization of negations, aggregations and 
constraint predicates. The code fragments for negation and aggregation can be 
seen on Figure [3] The most interesting part in these, when the uninstantiated 
query variables are checked whether they might be instantiated later. The case 
of constraint predicates is left to the reader. 

The algorithm described above generates all the possible goal orders and can 
simultaneously calculate their estimated costs. Note that the code has choice- 
points only when dealing with conjunctions. The optimizer procedure succeeds 
at most once, resulting in the best (cheapest) plan, or a failure if no well-moded 
plan can be found. 

3.4 Cost estimation 

The cost estimation of the optimizer is a field of further research. The exact 
parameters of the present algorithm will be refined during the use of the query 
planner in production systems. 

The present algorithm implements the following ideas: 

Constraints: None of the currently used constraints of the SINTAGMA system 
have more than one solution. Some of the constraints either succeed or fail, 
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i Query= ( Br anchA ; Br anchB ) , 

2 

3 concat_conjunction(BranchA,Continuation,WholeQueryA) , 

4 concat_conjunction(BranchB, Continuation, WholeQueryB) , 

5 

e optimize (WholeQueryA, empty , InstVars .Constraints , Opt A, InstVarsA) , 

7 optimize (WholeQueryB , empty , InstVars , Constraints , OptB , InstVarsB) , 

8 

9 intersection(InstVarsA, InstVarsB ,ResultVars) , 

in 

li 0ptimizedQuery=(0ptA;0ptB) 



Fig. 2: Optimizing Disjunctions 

but most of them implement functions, which means that the number of 
solutions is one. The cost and the number of solutions of constraint predicates 
are pre-defined constants for each of the predicates. 
Negation: The cost and number of solutions of a negated query are calculated 
from the cost and number of solutions of the query by a formula which is not 
fixed yet. The cost is smaller than the cost of the query, as the query has to 
supply only the first solution, not all. The number of solutions of a negated 
query is somewhere between and 1 (it either fails or succeeds once, with 
some probability). 

Aggregation: The cost of an aggregation is the cost of the aggregated query, 
plus some cost of collecting the solutions for the aggregation. The number 
of solutions (the number of groups) can be approximated by the ratio of the 
number of solutions of the aggregated query, and the number of solutions of 
it assuming that the GroupVariables are also instantiated. 

Disjunction: The cost and number of solutions of a disjunction is the sum of 
costs and number of solutions' of the two branches. 

Conjunction: The number of solutions of a conjunction is the number of solu- 
tions of the first member, multiplied by the (recursively calculated) number 
of solutions of the remaining part of the conjunction. The cost of a con- 
junction is the cost of the first member, plus the cost of the (recursively 
calculated) cost of the remaining part of the conjunction multiplied by the 
number of solutions of the first member. 

Source predicates: The cost and the number of solutions of source predicates 
are derived from statistical data. The Mediator continuously collects statis- 
tical data about the execution time and number of solutions of source pred- 
icates for each instantiation state of their arguments. From this data the 
query optimizer can estimate the cost and number of solutions of a source 
predicate in a query plan, if the statistical database contains information for 
the predicate with the same instantiation state of its arguments. If not, the 
query optimizer interpolates from the statistical data of other instantiation 
states, and may also use other information. For example, some relational 
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information sources can tell the number of rows in a table and the number 
of different values in a column of a table (from which one can tell if a column 
is a key). 

Source query packs: The number of solutions of a source query pack is the 
same as calling the source predicates in a conjunction. The formula to cal- 
culate the cost of a source query pack is subject to further research. There 
are many circumstances to consider: the filtering ability of the sources 

— if some arguments are bound to constants (this is covered by the statis- 
tical data) 

— if some arguments of different predicates are bound to each other 

— if some arguments of a single predicate appear in a constraint 

— if some arguments of different predicates appear in a constraint 

4 Evaluation 

Although the Mediator memorizes the planned (and optimized) queries, query 
planning time does matter, and extreme planning times are not acceptable. The 
runtime of the above described algorithm is exponential in the size of the in- 
put. The proposed branch-and-bound technique dramatically speeds up the op- 
timization code, but is still slow in some cases. However, there is an untapped 
opportunity to further reduce the runtime of the computation: memoizing the 
results of optimizing the sub-queries. 

Let us examine how the optimizer traverses the space of possible query plans 
of the query (a,b,c, . . .). First it chooses a as the first goal of the query, and 
recursively optimizes (b , c , . . . ) , which involves the recursive optimisation of 
(c, . . .). Next, it chooses b as the first goal of the query, and recursively opti- 
mizes (a, c , . . . ) , which involves the recursive optimisation of (c , . . . ) , and so 
on. Calculating of the best plan of (c, . . . ) is done several times. This dupli- 
cated work can be eliminated, if the optimizer memoizes the results of optimizing 
sub-queries. 

Memoizing is implemented as a meta-predicate that memoizes the best so- 
lution of a goal (according to an arithmetic expression), and succeeds at most 
once, unifying the goal with its best solution. It also memoizes the result if it 
is a failure or an exception. The meta-predicate also gives the called goal the 
opportunity to read the value of its best previous result, so it can stop traversing 
branches of its search space where no better solution can be found. 

The Query Optimizer uses the memoizer for all of its recursive calls, plus 
reads the value of its best previous result, and uses a simple cost-estimation 
method to decide whether producing a better plan is possible. The execution 
time of the algorithm is exponential in the number of sub-queries in conjunc- 
tion, therefore we have chosen conjunction chains to benchmark the different 
implementations. Table [1] shows the execution times of the Query Optimizer. 
The measurements were made with SICStus Prolog 3.12.5, on a machine with 
an Intel®Pentium®M 2GHz Processor. The results show that the runtime of 
the original algorithm is exponential, and that both the branch-and-bound and 
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the memoization techniques speed up the algorithm. However, memoization is 
not successful enough if the query has many source predicates referring to the 
same source. This is because when dealing with source predicates, the algorithm 
enumerates all the possible (callable) subsets of the source predicates in the 
query. Memoizing cannot help in this situation, but branch-and-bound helps: 
when using both the techniques, the size of the queries can be increased, no 
exponential increase in runtime can be observed. The last row shows the op- 
timization of a real-life query mustering up negation, aggregation, disjunction, 
conjunction, constraint and source predicates. 



Query 


Naive algorithm 


Branch&Bound 


Memoizing 


Both 


8 source preds, different source 


30.62 


0.0961 


0.1716 


0.0128 


9 source preds, different source 


276.74 


0.2016 


0.4053 


0.0171 


10 source preds, different source 


2766.28 


0.2986 


0.9584 


0.0222 


8 source preds, common source 


290.10 


0.0696 


15.2550 


0.0141 


9 source preds, common source 


5298.83 


0.1277 


140.2700 


0.0196 


10 source preds, common source 


N/A 


0.2071 


1439.8800 


0.0262 


specimen query 


0.0379 


0.0250 


0.0069 


0.0082 



Table 1. Execution times in seconds 



5 Related Work 

The compiler of Mercury, a pure declarative Prolog-variant does predicate 
reordering according to the I/O modes of the predicates, as described in [2J. 
The mode system of Mercury is much more expressive than the mode system of 
SINTAGMA's Query Optimizer, our in and out modes are easily handled by 
the Mercury compiler. On the other hand, it does not offer optimizations similar 
to our optimizer, it only reorders the predicates according to their I/O modes. 

The SIMS and the Infomaster information integration systems have a query 
optimizer component, as described in [7] and [8], however, they have a different 
task than ours. In those systems, query optimizers take advantage of semantic 
knowledge about the information sources to choose a query plan that needs the 
least number of information source accesses, among the plans which answer the 
user query. In the Mediator of SINTAGMA, this is the task of the Query Planner, 
and Query Optimizer optimizes only the query execution plan. 

6 Conclusion 

In the SILK information integration system, query plans often needed manual 
tuning, especially in the presence of information sources which have I/O mode 
restrictions. While SILK had only conjunction and disjunction in the query plans, 
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queries in SINTAGMA contain also negation and aggregation. With the growing 
use of information sources other than relational databases, the need for manual 
tuning of the more complex queries has become a major drawback. The Query 
Optimizer presented in the article is a part of the next release of SINTAGMA. 
During the testing of the system, some details, especially the cost estimation 
formulas will be refined. With the use of the Query Optimizer, we expect that 
manual tuning of query plans will become unnecessary. 
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Query=not (Query InNegation) , 

'/, collecting the variables that are not known to be instantiated 
collect_variables (Query InNegat ion, Vars) , 
subtract (Vars , Inst Vars , UninstVars) , 

'/, checking whether Continuation or the sleeping constraints 

'/, might instantiate one of these 

instant iates_variables (Continuation, ContVars) , 

disjoint(UninstVars,ContVars) , '/.if not, we fail here 

instantiates_variables (Constraints , ConstVars) , 

disjoint(UninstVars.ConstVars) , '/.if not, we fail here 

optimize (QuerylnNegat ion, empty , Inst Vars .Constraints , Opt Query , _Vars) , 
OptNegation=not (OptQuery) , 

optimize (Continuation, empty , Inst Vars , Constraints .OptCont .Result Vars) , 
create_conjunct ion (OptNegat ion, OptCont .OptimizedQuery) , 



Query=aggregation(GroupVars , SetExprs , Query InAggregation) 
extract_aggregated_vars (SetExprs , AggVars) , 

% collecting the variables that are not known to be instantiated 
'/, and will not be instantiated by the aggregation 
collect_variables (Query InAggregation, Vars) , 
subtract (Vars , Inst Vars , UninstVars) , 
subtract (UninstVars , GroupVar s , FlounderVars) , 

checking whether Continuation or the sleeping constraints 
% might instantiate one of these 
instant iates_variables (Continuation, ContVars) , 

dis joint (FlounderVars, ContVars) , °/,if not, we fail here 

instantiates_variables (Constraints , ConstVars) , 

dis joint (FlounderVars , ConstVars) , '/.if not, we fail here 

optimize (QuerylnAggr egat ion, empty , InstVars , Constraints .OptQuery , _Vars) , 
OptAggregation=aggregation(GroupVars , SetExprs , OptQuery) , 

union ( InstVars , GroupVar s , InstVarsl) , 
union ( InstVars 1 , AggVars , InstVars2) , 

wake_constraints (InstVars2, Constraints , InstVars3 .Constraintsl) , 

optimize (Continuation, empty , Inst Vars3 .Constraintsl , OptCont .Result Vars) , 
create_con junction (Opt Aggregation, OptCont .OptimizedQuery) 



Fig. 3: Optimizing Negated and Aggregated Queries 
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